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Abstract 

We present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimina¬ 
tion, and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically 
semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matri¬ 
ces. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative 
decompositions. The combination of the randomized compression with a fast ULV HSS factorization 
leads to a solver with lower computational complexity than the standard multifrontal method for many 
applications, resulting in speedups up to 7 fold for problems in our test suite. The implementation targets 
many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments 
show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves 
high performance and good scalability on a range of modern shared memory parallel systems, including 
the Intel® Xeon Phi (MIC). The code is part of a software package called STRUMPACK - STRUctured 
Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices. 


1 Introduction 

Solving large linear systems efficiently on modern hardware is an important requirement for many engineering 
high performance computing codes. For a wide range of applications, like those using finite element, finite 
difference or finite volume discretizations of partial differential equations (PDEs), the resulting linear system 
is extremely sparse. Fast solution methods exploit this sparsity, but also arrange the computations in such a 
way that most of the computational work is done on smaller dense submatrices. The reason for this is that 
operations on dense matrices can be implemented very efficiently on modern hardware. The multifrontal 
method [231 SSj is an example of a sparse direct solver where most of the work is done on dense, so-called 
frontal matrices. Unfortunately, dense linear algebra operations, for instance LU decomposition, require 
0(N 3 ) operations, where N is the matrix dimension. In a multifrontal solver these dense operations end up 
being the bottleneck. However, it has been observed that for many applications the dense frontal matrices 
have some kind of low-rank structure 13 - 

In [55] . a rank-structured multifrontal method is presented in which the larger frontal matrices are 
approximated by hierarchically semiseparable (HSS) [50] matrices. For certain model problems, this leads to 
a solver, or preconditioner, with linear or almost linear complexity in the total number of degrees of freedom 
in the sparse linear system. Here, we present an efficient implementation of a slightly modified version of 
the algorithm presented in [55] . The algorithm in [55] handles only symmetric positive definite systems, 
while the code presented here targets general non-symmetric non-singular matrices. For HSS compression, a 
randomized sampling algorithm from [35] is used. Earlier HSS construction methods, see [5B], cost at least 
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0(N 1 2 ), whereas the randomized method in combination with a fast matrix-vector product has a linear or 
almost linear complexity, depending on the rank-structure of the frontal matrix. 

An important concept used in the randomized compression algorithm is the interpolative or skeleton 
decomposition [15] , Use of this decomposition leads to a special structure of the HSS generator matrices (see 
Eq. (1121) 1. The HSS factorization used in [551 for symmetric matrices, and in [57] for non-symmetric matrices, 
exploits this special structure in a so-called ULV-like decomposition. In this paper, the ULV decomposition 
from m is used. 

The HSS format is a subclass of a more general type of hierarchical rank-structured matrices called T~L- 
matrices m HSS matrices are similar to "W 2 -matrices, another subclass of FUmatrices, in the sense that 
both formats have the special property that the generators are hierarchically nested (see Eq. m for what 
this means for HSS). This is typically not the case in the more general "H, the block low-rank (BLR) [?], the 
sequentially semi-separable (SSS) [50] or the hierarchically off-diagonal low-rank (HODLR) 0] formats (all of 
which are H-nratrices). In HSS and HODLR only off-diagonal blocks are approximated as low-rank whereas 
H, H 2 and BLR allow more freedom in the partitioning. In [3], BLR is used to approximate dense frontal 
matrices in the multifrontal solver MUMPS [B] while in other recent work [8] HODLR has also been proposed 
to accelerate a multifrontal solver. Both HSS and HODLR use similar hierarchical off-diagonal partitioning, 
but HSS further exploits the hierarchically nested bases structure, which can lead to asymptotically faster 
factorization algorithm for some matrices. 

Furthermore, thanks to the randomized HSS construction, our solver is also fully structured (compared 
to partially structured approaches where only part of the frontal matrix is compressed, see m) and the 
larger frontal matrices are never explicitly formed as dense matrices. 

Achieving high performance on multi/many-core architectures can be challenging but it has been demon¬ 
strated by many authors now that dynamic scheduling of fine-grained tasks represented by a Directed Acyclic 
Graph (DAG) can lead to good performance for a range of codes. This approach was used successfully in 
the dense linear algebra libraries PLASMA and MAGMA [2] and more recently it has become clear that it is 
also a convenient and efficient strategy for sparse direct solvers. For instance, in [35] the PaStiX solver [30j is 
modified to use two different generic DAG schedulers (PaRSEC [TdQ and StarPU [9]). In [Q StarPU is used 
in a multifrontal QR solver. In [33) . OpenMP tasks are submitted recursively for work on different frontal 
matrices, while parallelism inside the frontal matrices is also exploited with OpenMP tasks but with a man¬ 
ual resolution of inter-task dependencies. The sparse Cholesky solver HSL_MA87 [3T] uses a custom DAG 
scheduler implemented in OpenMP. Just as sparse direct solvers, hierarchical matrix algorithms also benefit 
from task parallelism: Kriemann [M] uses a DAG to schedule fine-grained tasks to perform H-matrix LU 
factorization. Our implementation uses OpenMP task scheduling, but since most of the tasks are generated 
recursively, a DAG is never explicitly constructed. 

The main contribution of this work is the development of a robust and efficient code for the solution of 
general sparse linear systems, with a specific emphasis on systems from the discretization of PDEs. Our work 
addresses various implementation issues, the most important being the use of an adaptive HSS construction 
scheme (Section 12.2.11) . based on the randomized sampling method Rather than assuming that the 

maximum rank in the HSS submatrices is known a-priori, it is computed in an adaptive way during the HSS 
compression. Other implementation techniques such as fast extraction of elements from an HSS structure 
(Section 14.51) are also indispensable to make the code robust and usable as a black-box solver. The code 
achieves high performance and good scalability on a range of modern multi/many-core architectures like 
Intel® Xeon and Intel® Xeon Phi (MIC), due to runtime scheduled task parallelism, using OpenMlrQ. The 
exclusive use of task parallelism avoids expensive inter-thread synchronization and leads to a very scalable 
code. This is the first parallel algebraic sparse solver with fully structured HSS low-rank approximation. The 
code is made publicly available with a BSD license as part of a package called STRUMPACkH STRUctured 
Matrices PACKage. STRUMPACK also has a dense distributed memory component, see [47] . 

This work advances the field significantly on several fronts. 

• Wang et al. [52] presented the first parallel multifrontal code with HSS embedding, called Hsolver. 

1 http://openmp.org/wp/openmp-specifications/ 

■ http://portal.nersc.gov/proj ect/sparse/strumpack/ 
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However, two shortcomings prevent it from being widely adopted: it is based on discretization on 
regular meshes and is only partially structured due to the hurdle of the extend-add of HSS update 
matrices (see Section 0]). Our new code, on the other hand, is a purely algebraic solver for general 
sparse linear systems and is fully structured, mitigating the HSS extend-add obstacle thanks to the 
randomized sampling technique (see Section Id.3D . 

• Napov and Li developed a purely algebraic sparse solver with HSS embedding [42], but it is only 
sequential and their experiments did not include the randomized sampling HSS compression. Our new 
code is parallel and we show detailed results with randomized sampling. 

In future work, building on the current paper and on the distributed HSS code developed in [47], we intend 
to develop a distributed memory algebraic sparse solver with HSS compression. 

The rest of the paper is outlined as follows. Some required background on HSS is briefly presented in 
Section [2] First, in EU the HSS rank-structured format is described. Next, the fast randomized sampling 
HSS construction 3^1 and the ULV decomposition m are discussed in Sections cu and 12.31 respectively. 
Section [3] describes multifrontal LU decomposition. Then, in Section [4] we discuss how HSS matrices can 
be incorporated into a multifrontal solver. Section [5] explains various aspects of the actual implementation. 
In Section [G] we present experimental results that illustrate numerical and performance aspects of the code. 
Finally, Section 0 has some concluding remarks and an outlook to planned future work. 


2 HSS: Hierarchically Semi-Separable matrices 

This section briefly introduces hierarchically semi-separable (HSS) matrices, mostly following the notation 
from [39]. HSS is a data-sparse matrix representation which is part of the more general class of "H-matrices 
and more specifically 77 2 -matrices. 


2.1 Overview of the HSS matrix format 


The following notation is used: is matlab-like notation for all indices in the range, * denotes complex 

conjugation, #I T is the number of elements in index set I T = {*i, i 2 , • • ■ , i n } and R T = R(I T ,:) is the matrix 
consisting of only the rows I T of matrix R. 

Consider a square matrix A € C NxN with an index set Ia = { 1,..., N} associated with it. Let T be a 
postordered binary tree, meaning that children in the tree are numbered before their parent. Each node r 
of the tree is associated with a contiguous subset t T C I. For two siblings in the tree, zq and v 2 , children of 
r, it holds that t Vl U t V2 = t T and t Vl H t V2 = 0. Furthermore, U T= i ea f( 7 -)t T = i roo t(T) = 1 a• The same tree 
T is used for the rows and the columns of A and only diagonal blocks are partitioned. An example of the 
resulting matrix partitioning is given in Figure flal and the corresponding tree is shown in Figure [Tb] 

The diagonal blocks of A, denoted D r , are stored as dense matrices in the leaves r of the tree T 


D t — A(I t , I T ). 


(1) 


The off-diagonal blocks A„ ltVa = A{I Vl , I V2 ), where zq and zq denote two siblings in the tree, are factored 
(approximately) as 

A Vl , V2 nU^B^ 2 (V^)* . ( 2 ) 

The matrices U^' g and V^ lg , which form bases for the column and row spaces of A Vlt „ 2 , are typically tall and 
skinny, with f7 blg having #I Vl rows and (column-rank) columns, V^ lg has #/y 2 rows and r° 2 (row-rank) 
columns and hence B Vl ^ V2 is r J Vi x r(j 2 . The HSS-rank r of matrix A is defined as the maximum of r r T and 
over all off-diagonal blocks, where typically r -C N. The matrices B Vl)V2 and B V2tUl are stored in the parent 
of v i and v 2 . For a non-leaf node r with children zq and zq, the basis matrices U^ lg and V^ lg are not stored 
directly since they can be represented hierarchically as 
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(a) HSS partitioning of a square matrix. (b) Tree corresponding to the HSS partition. 

Figure 1: Illustration of an HSS partitioning of a square matrix. Diagonal blocks are partitioned recursively. Figure (b) shows 
the tree, using postordering, corresponding to the partitioning in (a) and it illustrates the basis matrices stored in the nodes of 
the tree. 


Note that for a leaf node t/^ lg = U T and V^ 1S = V T . Hence, every node r with children v 3 and v 2l except for 
the root node, keeps matrices U T and V T . The example from Figure |Tj can be written out explicitly as 



Dx 

UxB 

i, 2 F 2 * 



r hB 2 ,xV{ 

d 2 



0 is 

is 0 

1_1 

u 6 b 6 , 3 v 3 * 

1 _ 1 

l-1 


'Ux O' 

0 u 2 

u 3 B 3 M v e ; 

vi 

0 

0 ' 
^5*. 


Da 

UaB 

4 , 5 ^* 



u 5 b 5A vi 

d 5 




(4) 


The storage requirements for an HSS matrix are 0(rN). Construction of the HSS generators will be 
discussed in the next section. Once an HSS representation of a matrix is available, it can be used to 
perform matrix-vector multiplication in 0(rN) operations compared to 0(N 2 ) for classical dense matrix- 
vector multiplication, see pH j47]. 


2.2 Fast HSS construction through randomized sampling 

In [39] Martinsson presents a randomized sampling algorithm for the efficient construction of an HSS rep¬ 
resentation of a matrix A. Note that the same technique was also used by Xia et al. in [571155] for HSS 
compression in a multifrontal solver. The main advantage of this approach is that it does not need ex¬ 
plicit access to all elements of A, but only needs a fast matrix-vector routine and selected elements from 
A. The matrix A never needs to be formed explicitly as a dense matrix and this allows to save memory. 
The overall complexity of the algorithm is 0(Nr 2 ), with r the HSS-rank of A, provided that a fast ( 0{N )) 
matrix-vector product is available. By comparison, other approaches based on direct low-rank compression of 
matrix off-diagonal blocks typically require (D(N 2 r ) operations. This section briefly presents the randomized 
compression algorithm, for a more in depth discussion see mm- 

Suppose the HSS-rank r is known a priori and R G C Nxd is a tall and skinny random matrix with d = r+p 
columns where p is a small oversampling parameter. Let S r = AR and S c = A*R be samples for the row 
(superscript r ) and column bases (superscript c ) of A respectively. Algorithm [T| with R r = R c = R computes 
the HSS representation of A using the information available in the samples S r and S c by hierarchically 
compressing (using interpolative decompositions, see below) the off-diagonal blocks of A, starting from the 
leaves. 

Let D t for a non-leaf node r with children v\ and v- 2 be defined as 


D t 


n A 
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If {n,T 2 ,... ,T g } are all the nodes on level £ of the HSS tree, then 

D& = dmg(D T 1 ,D T 2 ,...,D Tq ) ( 6 ) 

is an N x N block diagonal matrix. The main idea of the randomized sampling Algorithm |T] is to construct 
a sample matrix for each level of the tree as 

S w = {A- R = S r - D&R . (7) 

This sample matrix S™' captures the action of a product of the block off-diagonal part of A with a set of 
random vectors R. It is exactly this block off-diagonal part that needs to be compressed using low-rank 
approximation. 

Another crucial component of the randomized sampling algorithm is the interpolative decomposition 
(ID) [T?51. The ID computes a factorization of a rank-fc matrix Y € C mx " by expressing Y as a linear 
combination of a set J of k selected columns of Y: 


[X, J] = ID(F), such that Y = Y(:,J)X, Y(:,J) eC raxl , X £ C kxn , (8) 

or it can be modified to take a compression tolerance e, such that 

[X, J] = ID(y, e), s.t .Y = Y(:,J)X + 0(e), Y(\, J) e C mxk ', XeC fc ' xn , (9) 


where k 1 < k is the numerical rank. The ID can be computed from a rank-revealing or column pivoted QR 
decomposition jT6l [45] 

YU = Q [R ± R 2 ] , (10) 

where R\ is upper-triangular, followed by a triangular solve such that 


Y = (QRi) ([/ r^r 2 ] n- 1 ) = y(:, J)X . 


( 11 ) 


A consequence of using the ID in Algorithm [l] is that B Vl ^ = A(/^, J£ 2 ) 
matrix A. Furthermore, it also leads to a special structure for the U T and V T 


is a submatrix of the original 
generators: 


U r = II 


r 

T 



and 


V T =Ul 



( 12 ) 


referred to as interpolative bases, which can be exploited in the computations. Note that these interpola¬ 
tive bases are not orthonormal. Although creating orthonormal bases might slightly improve stability, the 
interpolative structure improves performance of the compression algorithm and the ULV decomposition, see 
Section 12.31 


2.2.1 Adaptive scheme to determine the HSS-rank 

In practice however, the HSS-rank of the matrix is not known in advance. In this case, Algorithm [T] can 
be called repeatedly while increasing the number of columns of i?, S r and S c . As long as d < r + p, the 
ID in line [9] will fail. Suppose the ID fails at node r, i.e., the required accuracy e is not reached, but the 
descendants of node r are successfully compressed. In that case, during the next iteration of Algorithm |T| 
with d t— d + Ad , it is not necessary to redo the compression (ID) or the extraction of D and B for the 
descendants of node r. However, those descendants do have to update the Ad new columns in i? r / c (lines fl2l 
and fl5l) and S r ^ c (lines [U [8] and [Toll . In [47] , this adaptive rank scheme is presented in more detail. 
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Algorithm 1: Computing the HSS factorization of a nonsymmetric matrix. 


i Function Ah ss = HSSCompress (R r , R c , S r , S c , £, r = root(Ah ss )) 

Data: S r = AR r and S c = A*R C with {S r , S c , R r , R c } G R Nxd , d > r max + p 
Result: dh ss : D T (leaves), B„ 1V2 , B V2 V1 (non-leaves), U T , V T (all except root). 
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foreach v E child(r) do HSSCompressC R r , R c , S r , S c , u) 
if child(r) = 0 then 
D t = A(I t , I T ) 

S r T = S r (I T ,:) - D T R r (I r , :) S* = S c (I r ,:) - D* T R C (I T ,:) 


else // v\ and are the children of node 
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2.2.2 Implementation issues 

The random matrices i? r and I? c are filled element by element using a pseudo-random number generator. 
Our implementation offers the minstd_rand and mtl9937 generators from the C++11 standard while the 
distribution can be either uniform over [0,1) or standard normal (Gaussian) Af(0, 1). By default the linear 
congruential engine^ minstd_rand is selected in combination with the Gaussian distribution. 

The rank-revealing QR factorization, used in the ID, could be replaced by a strong rank-revealing QR 
factorization [28], with possibly greater accuracy and smaller HSS-rank. This is left as future work. Two 
interesting alternative approaches to the randomized compression routine discussed in this section should be 
mentioned, namely adaptive cross approximation [10] and a matrix-free approach presented in [37] . 


2.3 ULV-like factorization and solve 


Solving a linear system with an HSS matrix can be done by first computing a so-called ULV decomposi¬ 
tion [18l . where U and V* are unitary matrices and L is lower triangular. However, in [55] and [57] , the ULV 
decomposition is modified to take advantage of the special structure of the U T and V T generators, see m- 
The resulting algorithm is referred to as ULV-like since it is no longer based on unitary transformations. 

In the first step of a ULV factorization, zeros are introduced in the HSS block rows. This step can be 
done using for instance a full QL factorization 


'o' 

, tlZUi = 

'O' 

Ui 

7 T * 

Ui 


(13) 


However, thanks to the special structure of U T , a multiplication from the left with a carefully chosen fl T is 
much cheaper and has a similar effect 


n T 




3 This choice is motivated further in section HI 
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We refer to m for a detailed description of the ULV factorization and the corresponding solve. 


3 Multifrontal sparse LU factorization 

This section briefly recalls the main ingredients of the multifrontal method for the LU factorization of general 
invertible sparse matrices. For a more detailed discussion of multifrontal methods, see [2.‘11 138] . The method 
casts the factorization of a sparse matrix into a series of partial factorizations of many smaller dense matrices 
and Schur complement updates. 


3.1 Matrix reordering 

As a preprocessing step, A is first scaled and permuted for numerical stability: A <— D r AD c Q c , where D r 
and D c are diagonal matrices that scale the rows and columns of A and Q c is a column permutation that 
places large entries on the diagonal. We use the MC64 code by Duff and Koster [22] to perform the scaling 
and column permutation. Popular alternative scaling algorithms can be found in |48., 7l [20]. After that, a 
fill-reducing permutation A PAP T is applied in order to reduce the number of nonzero elements in the 
LU factors. Permutation matrix P is computed using nested dissection applied to the adjacency graph of 
A + A T , using one of the graph partitioning tools SCOTCH jH] or METIS [52] . Instead of nested dissection, 
other heuristics like AMD [5] can be used. 

The multifrontal method relies on a structure called the elimination tree. The elimination tree serves 
as a task and data-dependency graph for both the factorization and the solution process. A few equivalent 
definitions of the elimination tree are available. We use the following, and we recommend the survey by 
Liu [38j for more detail on the method and the survey by L’Excellent for more detail about implementation 
issues like parallelism, memory usage, numerical aspects etc. [55] . 

Definition 1. Assume A = LU, where A is an NxN sparse, structurally symmetric matrix. The elimination 
tree of A is a tree with N nodes, where the i-th node corresponds to the i-th column of L and with the parent 
relations defined by: parent(j) = min{i : i > j and ^ 0}, for j = 1,..., N — 1. 

In practice, nodes are amalgamated: nodes that represent columns and rows of the factors with similar 
structures are grouped together in a single node. For instance when using nested dissection reordering, all 
vertices from the same graph separator can be grouped in one elimination tree node. In the end, each node 
corresponds to a square dense matrix, referred to as a frontal matrix , with the following 2x2 block structure: 


An A12 

A21 F22 


(15) 


3.2 Numerical factorization 


Multifrontal factorization of the matrix consists in a bottom-up traversal of the tree, following a topological 
order (a node is processed before its parent). Processing a node means first forming (or assembling ) the frontal 
matrix followed by elimination of the fully-summed variables in the An block and finally a Schur complement 
update step. The frontal matrix A: is formed by summing the rows and columns of A corresponding to the 
variables in the An, A 21 and Fj 2 blocks, with the temporary data - the extended update matrices - that 
have been produced by the children of i after their elimination step, i.e., 


A i = Ai 


U v = 


v £ child(i) 


A(ir,iD 

i A(/" pd ,/r p ) 


A( 7 se P)/ u pd ) 




(16) 


where A = {I^ ep , /j Upd } is the set of row and column indices of A,; w.r.t. the global matrix A, after reordering. 
Eliminating the fully-summed variables in the An block is done through a partial factorization of A,;, typically 
via a standard dense matrix factorization of the An block. Next, the Schur complement (contribution block 
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or update matrix) is computed as Ui = F 22 — F 2 iF^ 1 1 F 12 and stored in temporary memory. In contrast to the 
elimination step which uses straightforward dense matrix operations (high performance LAPACK/BLAS3 
codes), the assembly step (1161) requires index manipulation and indirect addressing while summing up 74- 


For example, if two children’s update matrices Uk = 


bk 

Cfc dk 


, k = y-[. V 2 , have subscript sets /} pd = {1,2} 


and /^ lpd = {1,3}, respectively, then those update matrices can only be added after aligning the index sets 
of the two matrices by padding with zero entries 


U 1 ^U 2 =U 1 + U 2 
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(17) 


This summation operation is called extend-add, denoted by -<{>. The relationship between frontal matrices 
and update matrices can be revealed by F-i = Ai <{> U V2 where nodes zq, 1 / 2 ,..., v q are 

the children of i. 

Each partial factorization might involve pivoting within the frontal matrix. It can also happen that no 
suitable pivot can be found during a step of partial factorization. In this situation, the corresponding row 
and column remain unfactored and are sent to the parent node. This strategy is used for instance in the 
MUMPS [5] code. Currently our code does not perform any such delayed pivoting, but instead relies on 
static pivoting (using MC64) and partial pivoting during the LU decomposition of the Fn blocks. 


3.3 Solution 

Once the factors are computed, the solution x of Ax = b is computed in two steps: forward solution by doing 
a triangular solution with the L factor and backward substitution by doing a triangular solution with the 
U factor. The forward solution step is a bottom-up topological traversal of the elimination tree, while the 
backward substitution is a top-down traversal. 


4 Multifrontal solver with HSS frontal matrices 

This section explains how a multifrontal solver, see Section^ can be used in combination with the HSS data- 
structures and algorithms from Section[2]to improve the computational complexity and storage requirements. 
This section closely follows [55] . 

4.1 Selection of HSS frontal matrices 

Note that the largest frontal matrices, those that determine the computational complexity of the solver, 
typically correspond to nodes closer to the root of the elimination tree. Let the top of the tree, i.e., the 
root node, be at level £ = 0 of the tree. Then, define a switch-level i s such that the frontal matrices at 
levels l > £ s of the elimination tree are stored as regular dense matrices whereas those at levels £ < £ s 
are compressed using the HSS format. According to the analysis in [5SJ, £ s should be chosen such that the 
factorization cost above and below the switch-level are equal. However, this rule is not very practical and 
experiments show that performance depends crucially on the choice of £ s . 

4.2 Separator reordering 

Apart from the scaling and permutation of A for stability, and nested dissection reordering to reduce fill- 
in, an additional reordering is applied to the index set of each separator. This reordering is needed to 
obtain favorable HSS rank structure in the corresponding frontal matrices. It is computed by recursively 
bisecting the graph of the separator in subgraphs of size approximately b (defaults to b = 128), using a 
graph partitioning tool (SCOTCH or METIS). Each partition then corresponds to a leaf in the HSS tree 
of the corresponding frontal matrix. However, since a separator graph can be disconnected, it is enriched 












R 2 R -3 


Figure 2: Illustration of the extend-merge procedure for the random vectors. Node 3 needs 3 random vectors. It can get 
elements 7*4,1 and 7*5,1 from either child 1 or 2 . Elements 7-7,1, 7*7,2, 7*4,2 and 7*5,2 are copied from child 2 . Elements 7*6,1 and 7*6,2 
in R% are generated with the properly seeded pseudo random number generator. When the adaptive HSS compression scheme 
decides that a third column has to be added to R%, those elements are also generated. 


with length-two connections from the connectivity graph before it is passed to the partitioner, see also the 
discussion in |32] . Note that other reorderings can be used instead of nested dissection. The influence of the 
reordering on the ranks of off-diagonal blocks is studied in [53] . 

4.3 Skinny extend-add 

From here on, we assume that a binary elimination tree is used. The steps followed for each HSS frontal 
matrix Ti are as follows. First, a random matrix Ri e C# /iXdi is constructed. If the children zq and v 2 of i 
are also HSS, then Ri is constructed as follows 

( R Vl (r, c) = i?„ 2 (r, c) if c < min (d Ul , d„ 2 ), Ii(r) e /" pd and A(r) <E J“ pd , 

R .( rc )= J^i( r ’ c ) if c < d Vl and h(r) G J“ pd , 

I R„ 2 (r,c) if c < d V2 and /,(r) G /“ p '\ 

( random(r, c) otherwise 

The random matrices of the children are merged in the parent Ri and any elements not present in any of 
the children’s R are generated. This extend-merge procedure is illustrated in Figure [2j If node i has no 
(HSS) children, Ri is generated. However, it is important that corresponding “random” entries in R U1 and 
R V2 are equal, since that allows efficient evaluation of S\ = TiRi (similarly for T*Ri) based on 

TiRi = ( A, 4 U V1 4 . U V2 ) Ri = (. AA ) 4 (U Vl Ri{I*f ,:)) 4 (^ 2 i? J (/" pd ,:)) , (19) 

where i?i(/“ pd J : ) denotes the subset of rows of Ri which are also in /“ pd and 4 denotes an extend-add 
operation where the extend is only done for the rows, not the columns. By the construction of Ri m, 
the first d Ul columns of i?i(/)f pd i 0 are already available at node zq, which is convenient for the evaluation 
of U Vl /i’,;(/“ i pd .:). Evaluation of U Vl Ri(I^f A ,:) is discussed in more detail in Section l4~4l When generating 
rows in Ri, the random generator is seeded for each row using the global row index Ii{r), to ensure that 
Ri is consistent with its sibling. This frequent seeding is the reason the linear congruential pseudo-random 
engine minstd_rand was chosen as default over for instance the mt 19937 Mersenne-Twister, which has a 
much bigger internal state. 

The frontal matrices T, with level(i) < l s are completely approximated by HSS and are never explicitly 
formed as a dense matrix. This in contrast to earlier, so-called partially structured approaches where for 
instance only the F u or the F 2 i, F n and F\ 2 blocks are compressed [52]. Partially structured approaches 
typically at one point or another form a dense representation of the F 22 block, perform the Schur complement 
update on it, and then use this dense update matrix in the extend-add procedure. This is done to avoid having 
to perform an overly complicated extend-add operation on HSS matrices. However, the approach followed 
here does not require first assembling a dense frontal matrix before doing HSS compression. This is due to 
the use of the randomized HSS compression, Algorithm [I] which only requires matrix-vector multiplication 
and extraction of selected elements from the frontal matrix. 
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When R i} S] and Sf have been constructed, HSS compression using Algorithm [T] can be performed. 
However, when di+p is less than the HSS-rank of T % . Algorithm |T] will fail. In that case, columns are added 
to Ri, i.e., di += Ad (Ad = 128 by default), the new columns of SJ and Sf are computed and Algorithm [T| 
is called again, this time only updating the new columns of Ri, S / and Sf. Due to the use of the ID in 
Algorithm [I] HSS generators D T and B Vl ^ 2 are submatrices of J-). Hence, a routine to extract specific 
elements from F, is required. This routine will be described in Section [4.51 

4.4 ULV factorization and low-rank Schur complement update 

After HSS compression, a factorization of F ill is performed: classical row-pivoted LU if F, is dense, ULV 
if it is HSS. For a dense frontal matrix, Ui = F i22 — F i21 F~^F il2 is computed explicitly. In the HSS case, 
F l22 is kept in HSS form and the update F i21 F~^F il2 = ©*<P; is stored as a low-rank product. Expressions 
for 0* and are derived and presented in detail in [55] for symmetric and in m for non-symmetric 
matrices. Given Ui = F i22 — 0*4^, the multiplication with Ui in (flT)l) can be performed efficiently using HSS 
matrix-vector multiplication for F i22 and two dense (rectangular) matrix products for 0* and 4>j. 

4.5 Extracting elements from an HSS matrix 

Finally, extracting elements from Fi requires extracting elements from an HSS matrix. In |55j a routine 
is presented for extracting multiple elements from an HSS matrix while trying to minimize the number of 
traversals through the HSS tree. We use a conceptually simpler algorithm based on the HSS matrix-vector 
multiplication. By multiplying an HSS matrix with unit vectors, selected columns can be extracted. At the 
leaf nodes, instead of multiplying with a unit vector, one can simply select the proper columns of V*. Unlike 
for matrix-vector multiplication, during element extraction parts of the tree traversal can be pruned. 

4.6 Preconditioning versus iterative refinement 

Direct solvers often use a few steps of iterative refinement to improve the solution quality [S3]. However, 
the multifrontal method with HSS compression as presented in this paper is used as a preconditioner for 
GMRES instead. For the same number of multifrontal solve steps (preconditioner applications), a Krylov 
solver typically leads to smaller residuals than iterative refinement. This is particularly useful when the HSS 
compression tolerance is increased, since in that case the HSS-multifrontal method is no longer an exact 
direct solver and the number of outer iterations increases. 

4.7 Solver complexity 

The computational complexity of a standard multifrontal solver is typically dominated by the dense linear 
algebra corresponding to the few largest frontal matrices. For instance a nested dissection reordering on a 
d-dinrensional mesh with N = k d vertices has a top separator with 0(k d ~ l ) vertices, leading to an overall 
complexity of 0(fc 3 ( d_1 )), i.e., 0(N 3 / 2 ) and 0(N 2 ) for 2D and 3D meshes respectively. 

For the HSS-embedded multifrontal solver, the complexity is dominated by the HSS compression of the 
dense frontal matrices, which in turn depends on the rank pattern. Earlier works by Chandrasekaran et 
al. [17] and Engquist & Ying [20 showed the rank patterns of the elliptic and the Helmholtz operators 
respectively. Xia showed complexities for the randomized HSS multifrontal solver assuming different rank 
patterns [55:. Combining the above results, we summarize the solver complexities for two types of PDEs 
and two sparse solvers, see Table [TJ 

5 Shared memory parallel implementation 

The algorithm presented in Section [4] has been implemented using C++ and OpenMP, targeting shared 
memory platforms. The code relies on BLAS, LAPACK, METIS and/or SCOTCH and a recent C++11 
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MF 

MF-HSS-RS 

problem 

HSS rank 

factor flops 

memory 

factor flops 

memory 

2D 

(k x k) 

elliptic 

0(1) 

0(JV 3/ 2) 

0(N log N) 

O(N) 

O(N) 

Helmholtz 

O(logfc) 

3D 

(k x k x k) 

elliptic 

O(k) 

0(N 2 ) 

0(N 4 / 3 ) 

0{N 10 / 9 log N) 

0{N) 

Helmholtz 

0{k) 


Table 1: Summary of the complexities of the standard multifrontal solver (MF) and the randomized HSS-embedded multifrontal 
solver (MF-HSS-RS) applied to two important classes of problems. The mesh size per side is k and the matrix dimensions are 
N = k 2 in 2D and N = k 3 in 3D. 


compliant compiler with support for OpenMP 3.1 or higher. The code makes heavy use of the OpenMP task 
construct. OpenMP was chosen because it is easy to use, performs well and is well documented and supported. 
However, alternatives like Intel® Threading Building Blocks [46] or Cilk(+) [TTj offer conceptually similar task 
parallelism. Switching to one of those should not be hard. While other runtime systems like QUARK 158], 
DAGuE/PaRSEC [14] and StarPU [9] (distributed memory task scheduling) and OmpSs [24] might have 
certain specific advantages over the OpenMP runtime, many of those innovations, for instance explicit 
modeling of task dependencies or task-offloading, are eventually incorporated in the OpenMP standard as 
well. 

OpenMP tasks are created and scheduled at runtime by the scheduler. Task schedulers typically use a 
work stealing [T2] or task stealing strategy to balance load between threads. Each thread/core has its own 
local queue of tasks. When a thread runs out of work it can steal a task from one of the other thread’s task 
queues. 

5.1 Task based tree parallelism 

Traversals of both the elimination tree and the HSS hierarchy allow for tree parallelism, i.e., independent sub¬ 
trees can be processed concurrently. For instance, multifrontal factorization requires bottom-up topological 
traversal of the elimination tree, just like HSS compression requires bottom-up traversal of the HSS hierarchy. 
The code in Listing [T| shows how to do a parallel bottom-up tree traversal using the OpenMP task construct. 
The tree is stored as objects of a class Tree with two members left_child and right_child, both pointers 
to subtrees, also objects of type Tree. In Listing [l] the variable depth keeps track of the recursion depth and 
no more tasks are generated after a certain depth to avoid excessive overhead of creating too fine-grained 
tasks. Experiments show that setting d_max to log 2 (#threads) + 3 leads to a good task granularity. With 
this setting, the maximum number of tasks at any given point in time is about 2 d - max = 8 • ^threads. This is 
enough to ensure good load balance and avoids excessive task creation overhead. OpenMP tasks supports an 
if clause, so the check if (depth<d_max) could have been put in the OpenMP pragma. However, optimizing 
the code to perform this check outside the directive completely avoids all task creation and synchronization 
overhead when it evaluates to false. The final (condition) clause informs the OpenMP runtime that the 
generated task will not generate more tasks if condition evaluates to true. Finally the untied clause in¬ 
forms the runtime that this task can be moved to a different thread when it encounters a scheduling point. 
For instance, when a task spawns a new task, the spawning task may be moved to another thread. Untied 
tasks allow for better load balance, whereas tied tasks (the default) typically lead to better data locality. 
The taskwait pragma ensures processing of the children is finished before continuing with the parent. 

5.2 Hybrid node and tree parallelism 

Exploiting tree parallelism alone as in Listing [1] does not scale well due to the limited degree of parallelism 
near the root. Although the HSS-multifrontal algorithm can exploit two nested levels of tree parallelism 
(elimination tree and HSS hierarchy), the scaling bottleneck remains. To overcome this, one needs to exploit 
parallelism in the computational work inside the tree nodes, which are mostly dense linear algebra opera¬ 
tions. However, work sharing constructs like OpenMP parallel for loops are not allowed within OpenMP 
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void Tree : : postorder (depth=0) { 
if (depth < d.max) { 
if (left_child) 

^pragma omp task untied default ( shared ) final (depth >= d_max — 1) mergeable 
left_child —>postorder (depth+1) 
if ( r ight _chi 1 d ) 

^pragma omp task untied default ( shared ) final (depth >= d_max — 1) mergeable 
right_child —>postorder (depth+1) 

^pragma omp taskwait 
} else { 

if (left.child) 1 e f t _c h i 1 d —>post order ( depth+1) 
if (right_child) right_child —>postorder (depth + 1) 

} 

do_stuff(depth); // factor/compress..., can generate more tasks 

} 

Listing 1: Bottom-up topological parallel tree traversal implemented with recursion and the OpenMP (3.1) 
task construct. 


tasks. Moreover, calling multithreaded BLAS or LAPACK routines from multiple tasks/threads leads to 
over-subscription and generally poor performance. This is because existing multithreaded BLAS/LAPACK 
libraries are optimized to use the entire machine. One possible strategy is to exploit tree parallelism only 
for the lower levels of the tree and switch to a sequential processing of the nodes higher up in the tree while 
switching to multithreaded linear algebra. However, this leads to many synchronization points and does not 
scale with increasing number of threads. Our approach on the other hand is to use task parallelism within 
the tree nodes as well to allow for a seamless transition between tree and node parallelism, since scheduling 
of tasks is left to the runtime system. When getting closer to the root node there is a shift from tree to node 
parallelism. This is illustrated in Figure [3] Even in the case of highly unbalanced trees, the runtime can 
assign work evenly to the available cores. We chose not to use an existing library for the task based dense 
linear algebra, for instance PLASMA (based on the QUARK runtime), since we wished to exploit the same 
threading mechanism (OpenMP) already used for the tree parallelism. 


5.3 Parallel BLAS and LAPACK 


One of the most time consuming operations of the algorithm is dense matrix-matrix multiplication C •<— 
aAB + /3C. This can be implemented easily with recursion and task parallelism [41 j . by splitting the 
problem into smaller matrix-matrix multiplications; this strategy is referred to as divide-and-conquer and is 
often used in so-called cache-oblivious algorithms [26]. How the matrices are split depends on their shapes. 
Let A be m x k and B be k x n, then 


C <- 


'aAB + pC 


a 

AB 0 

ABi 

+ P 

a 

~a 0 b 

AiB 

+ P 

<3 o 

j_ 

(AqBq + A\Bi) T 


C 0 C x 


if m x n x k < T, 
else if n > max(m, k), 

else if m > k, 

else 


( 20 ) 


The last case in (12011 . short fat A times tall skinny B, uses two consecutive recursive matrix-matrix multipli¬ 
cation calls. Cases 2 and 3 start two multiplications in parallel, spawning two tasks. The recursion ends when 
reaching case 1, with T a tuning parameter set by default to T = 64 3 , where a sequential vendor optimized 
BLAS3 *gemm routine is called. Depending on the scalar type, one of four inlined template specialization 
functions for gemm<scalar> is executed to pick the correct version: sgemm, dgemm, cgemm or zgemm. For the 
other BLAS2/3 routines that are required, for instance triangular matrix multiplication and solve, a similar 
recursive approach is used. This recursive task generation is also stopped when the recursion depth becomes 
too large, with the same depth parameter being passed through the entire code and incremented each time 
it enters a new task. 
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constant overall concurrency 



(a) Three nested levels of parallelism. 


(b) Constant overall concurrency. 


Figure 3: Schematic illustration of the different types of concurrency in the code and the gradual shift from tree parallelism to 
in-node parallelism, (a) Tasks for dense kernels (o) are nested in nodes (o) of the HSS trees, which are nested in the elimination 
tree nodes (□) (e-tree), (b) Left-to-right, top-to-bottom: (1) Elimination tree concurrency decreases when getting closer to the 
root node. (2) Closer to the root of the elimination tree, more HSS tree concurrency is exploited as it becomes available, i.e., 
while moving down the HSS tree away from the root. (3) Towards the root of the HSS tree and the root of the elimination tree, 
more in-node concurrency (parallel tasked dense algebra) is exploited. (4) The product of the 3 types of concurrency, i.e., the 
overall concurrency, remains constant throughout both the elimination and HSS trees. 
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Figure 4: Speedups over sequential getrf from Intel® MKL for matrices of size 500 2 to 4000 2 . Left: Recursive LU decompo¬ 
sition using OpenMP tasked BLAS code. Middle: Reference LAPACK getrf using OpenMP tasked BLAS code. Right: MKL 
optimized multithreaded getrf. Our recursive implementation scales better than the reference netlib getrf with parallel BLAS 
but worse than the MKL optimized code. However, calling MKL multithreaded getrf from multiple threads simultaneously 
would lead to over-subscription and performance penalty. This is not a problem with the recursive LU because it uses the 
OpenMP task runtime, just like the rest of the code. 


The code also requires some LAPACK functionality, namely LQ, LU and RRQR decompositions. For 
those, we modify the reference Fortran LAPACK implementation to make use of our parallel (tasked) BLAS 
routines. Some vendor optimized LAPACK libraries do not just use the LAPACK reference code on top of 
multithreaded BLAS calls, but add additional optimizations to the LAPACK routines. Unfortunately, in our 
approach we cannot take advantage of these optimized multithreaded codes. Consider partial pivoted LU 
decomposition, used for the Fn block of a dense frontal matrix. Apart from the LAPACK *getrf routine 
using our OpenMP tasked BLAS routines, we also implemented a recursive LU factorization algorithm 21 i . 
The parallelism in this algorithm has to come from the BLAS routines, triangular solve, row permutation, 
and matrix-matrix multiply. Figure [4] compares the performance and scalability of the two LU decomposition 
approaches with the MKL optimized implementation and shows our implementation of LU scales nearly as 
well as MKL without sacrificing the ability to exploit subtree concurrency. A more scalable approach ng, 
based on so-called tiled algorithms instead of recursion, partitions the matrices in tiles of fixed sizes and 
assigns tasks to each of the tiles while explicitly modeling the data dependencies between the tasks. A DAG 
(directed acyclic graph) scheduler then executes the tasks while respecting their dependencies. OpenMP 
supports explicit task dependencies since version 4.C0. We intend to exploit this feature in the future to 
achieve more scalable dense linear algebra operations. For the rank-revealing QR decomposition we use a 
modified version of the LAPACK *geqp3 code [45], a BLAS3 version of column pivoted QR. The routine 
is modified to call our parallel tasked BLAS and an extra tolerance parameter e is added to stop the rank- 
revealing process as soon as the £-rank has been found instead of computing the full decomposition. More 
precisely, numerical rank i is detected when Ri + i t i + \/Rn < e, where R is the upper-triangular factor. 

5.4 Scaling bottlenecks 

Before the actual numerical factorization step, but after matrix scaling and nested dissection reordering, 
a symbolic factorization step is performed. During this step some memory is allocated and the index sets 
ju pd are assem bled. The symbolic factorization is a bottom-up tree traversal which is done in parallel, as 
in Listing [1] In a multithreaded setting, memory allocation can become a serious scaling bottleneck. We 
have found that the use of a scalable memory allocator, like TCMalloc m or the TBB scalable memory 

4 Not all compilers currently support the latest OpenMP 4.0 standard. 
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allocator [JBj greatly improves the performance over for instance the default malloc in gli bc@. For instance, 
running on a 60 core Intel® Xeon Phi, the symbolic factorization phase runs up to 56 x faster when using 
TBBMalloc instead of the default allocator. 


6 Numerical experiments 

This section presents various numerical results. Section RTT1 first focuses on some PDE problems on regular 
grids as this allows us to easily change the problem size. The following sections consider other matrices from 
various applications. Unless otherwise stated, the experiments are performed on a single 12-core socket of a 
single node of the NERSC Edison machinc0. A compute node has two 12-core Intel® Ivy Bridge processors at 
2.4GHz. Double precision peak performance is 19.2Gflop/s per core, 230.4Gflop/s per socket or 460.8Gflop/s 
per node. Each socket has 32GB DDR3 1866MHz memory, hence 64GB per node, with a STREAM jib] 
bandwidth of 48.5GB/s. We use the Intel® 15.0.1 compiler with sequential MKL. 

6.1 PDEs on a regular grid 

We start with a number of benchmarks of well-known PDEs on regular 2D and 3D grids to study scaling 
of time-to-solution, number of floating point operations, memory usage, HSS-ranks etc., with respect to 
problem size. For these regular grids, a geometric nested dissection code is used instead of the default 
METIS graph partitioner. The following benchmark problems are considered: 

• Poisson equation —A u = / on a 2D grid using the standard 5-point finite difference stencil with 
homogeneous Dirichlet boundary conditions. 

• Poisson equation on a 3D grid using the standard 7-point stencil with homogeneous Dirichlet boundary 
conditions. 

• Convection diffusion equation [43] —vAu + v ■ Vu = / on a 2D grid using a 5-point upwind stencil, 
with viscosity v = 10” 4 and 

v = (ie(1 -x)(2y~ 1) y(l-y)(2x-l)) T . (21) 

• Convection diffusion, similar as above, on 3D grid with 

v = (2x(l - x)(2y - l)z —t/(l - y){2x - 1) ~(2x - l){2y - l)z(l - z)) T . (22) 

• Helmholtz equation 

(—A — oj 2 /v(x) 2 )u(x,uj) = s(x,iu) (23) 

on a 2D grid, with u> the angular frequency, v(x ) the seismic velocity and u{x,u >) the time-harmonic 
waveheld solution to the forcing term s(x, oj). The discretization uses a 9-point stencil and the frequency 
is set at / = 10Hz with ui = 2nf. We use a sampling rate of about 15 points per wavelength and PML 
boundary conditions. This example uses complex arithmetic. 

• Same as H2D, but 3D using a 27-point stencil. 

A crucial parameter for performance is the number of levels t s of the elimination tree for which HSS 
compression is performed. Note that l s = 0 corresponds to a pure multifrontal solver. Unfortunately, the 
optimal l s is impossible to predict a priori, so it is determined experimentally and will always be mentioned 
with each result. The same applies to the compression tolerance e. When i s > 0, i.e., with HSS compression, 
the multifrontal solver is used as a preconditioner for restarted GMRES(30) with modified Gram-Schmidt 

5 http://www.gnu.org/software/libc/ 

e https://www.nersc.gov/users/computational-systems/edison/ 
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and a zero initial guess. Without HSS compression, iterative refinement with the direct solver is used. All 
experiments are performed in double precision with relative or absolute stopping criteria ||u,;||/||uo|| < lCfo 6 
or llu.ll < 10 _1 °, where iq = M~ 1 {Axi — b ), with M the approximate multifrontal factorization of A, is the 
preconditioned residual. The right-hand-side is always set to A [l 1 • • • l] T . 

Figure [5] shows timing results for the 2D (top) and 3D (bottom) Poisson equation on 5000 2 and 125 3 
grids respectively. Figure [5a] shows numerical factorization time as a function of the number of levels in the 
elimination tree for which HSS compression was used. The HSS levels always correspond to the top levels of 
the elimination tree. This shows that applying HSS compression leads to a speedup of about 2x for 7 HSS 
levels. Different lines correspond to different HSS compression tolerances e. Somewhat larger factorization 
speedups are possible for e > 10~ 4 . However, this does not lead to faster time-to-solution. Figure [5b] shows 
the cumulative time for nested dissection reordering, symbolic factorization, numerical factorization and 
GMRES solve. For e > 10 -4 , the number of GMRES iterations, and thus the number of applications of 
the multifrontal solve, increases too much to get overall speedup. Best results were obtained with t s = 8, 
£ = HR 7 and only 2 GMRES iterations (3 multifrontal solves). Figures PTFI and l5dl show the timings for the 
3D Poisson problem. For the 3D problem, much more aggressive HSS compression can be used. Best results 
were obtained with £ s = 10, £ = 0.9 and 61 GMRES iterations. For the Poisson problem it seems that 
for 2D the direct solver is very efficient, with a modest speedup from HSS, while for 3D the HSS enabled 
factorization leads to a good preconditioner. 

Figure [Ba] shows the total number of flops (numerical factorization and GMRES solve) for solving a 2D 
Poisson equation as function of the number of degrees of freedom, again for different compression tolerances. 
For the pure multifrontal method (no compression), the number of flops is 0(N 3 / 2 ), as predicted by the 
theory. For 2D Poisson the HSS-rank is independent of the grid size El, which leads to an optimal solver, 
i.e., linear scaling in the number of unknowns, see the fit in Figure Ifial Note the much larger constant for 
the HSS method. For the 5000 2 problem there is a reduction in the number of flops by a factor of about 
3.3x. However, the observed speedup (Figure [5b]) is smaller than that. This is due to the fact that although 
the number of flops for the factorization decreases, the number of flops for the solution phase (and GMRES 
iterations) increases. Although multifrontal solve requires an order of magnitude less flops than factorization, 
it runs at much lower flop rates on modern hardware because it is limited by the memory bandwidth instead 
of the floating point unit. Additionally, the flop rate in the factorization phase is lower when using HSS 
compression due to the more fine-grained task decomposition. Figure [6b] shows number of flops-to-solution 
for the 3D Poisson equation. For the very aggressive compression, e = 0.9, the number of floating point 
operations for the 125 3 problem are reduced to 4.4% of the number of flops for the multifrontal method. 
Figure I7al shows the total solve time for the 3D problem for different grid sizes. These times include matrix 
reordering, factorization and GMRES solve. Figure l7bl shows the size of the factors. 

Table [2] shows detailed results for the 6 PDE problems. The best speedups are obtained for the 3D 
problems. The code achieves good performance in flops per second for the factorization phase; although 
slightly less so for the HSS enabled code. Since the performance of the solve phase is not bounded by the 
floating point unit but rather by the memory bandwidth, we report the approximate attained bandwidth. 
The detailed results from Tableware summarized in Figure [SI 

The e and £ s values used for Tableland Figures [5][7] were chosen to minimize the total time to factor and 
solve a single linear system, i.e., the optimal trade-off between factorization time and number of GMRES 
iterations. When multiple consecutive solves with the same matrix are required, one needs to select different 
i s and £ values. For many consecutive and highly accurate solves, the pure (exact) multifrontal factorization 
is probably optimal as it minimizes the number of multifrontal triangular solves. However, suppose only a 
few digits of accuracy are required. The multifrontal HSS solver can be used as a direct solver and due to 
the smaller factor size the solve phase will be faster than a solve with the pure multifrontal code. 

The times for symbolic factorization in Table [2] are larger for the multifrontal method than for the HSS 
solver. This is because the memory for dense frontal matrices is allocated during the symbolic factorization 
while memory for the HSS generators is allocated during the numerical factorization since HSS-ranks are 
not known in advance. 
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(a) Num fact time, 5000 2 . 



(c) Num fact time, 125 3 . 



(b) Total solve time, 5000 2 . 



(d) Total solve time, 125 3 . 


Figure 5: Times for factorization (left) and solve (right) of the 2D (top) and 3D (bottom) Poisson equation on 5000 2 and 
125 3 grids as function of the number of levels i a in the elimination tree for which HSS compression is applied. Different curves 
correspond to different HSS compression tolerances e. For 3D, much more aggressive HSS compression can be used. 
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(a) 2D Poisson. 



Figure 6: Scaling of the number of floating point operations required to factor and solve a 2D (a) or 3D(b) Poisson equation, 
(a) The theory predicts G(N 3 / 2 ) complexity for the multifrontal (MF) solver and optimal O(N) I55| complexity with HSS 
compression, (b) 0(N 2 ) complexity for the multifrontal solver and slightly lower complexity with HSS compression. The fits 
(black lines) are very sensitive to the data and not very reliable. However, note the smaller exponents and the larger constants 
for the new solver. 




N 

(b) 3D Poisson, factor size. 


Figure 7: (a) Total solve time (reordering, factorization and solution) for the 3D 125 3 Poisson equation, (b) Factor size for 
the same problem. Different lines correspond to different HSS compression tolerance e, MF refers to pure multifrontal. The 
HSS enabled solver is faster for larger problems, and it allows to solve larger problems. 
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problem 

P2D 

P3D 

C2D 

C3D 

H2D 

H3D 

grid size 

5000 2 

125 3 

5000 2 

125 3 

4000 2 

100 3 


nested dissection time (s) 

2.5 

0.25 

2.1 

0.24 

2.9 

0.43 


symbolic factorization time (s) 

3.6 

8.0 

3.4 

8.1 

4.5 

6.0 


factorization time (s) 

29.1 

254.7 

28.8 

252.4 

53.6 

259.1 

o3 

factorization flops (xlO 12 ) 

4.9 

50.0 

4.9 

50.0 

10.1 

53.5 

O 

£ 

flop rate (xlO 9 Gflop/s) 

168.4 

196.3 

170.1 

198.1 

188.4 

206.5 

3 

fraction of peak 

73% 

85% 

74% 

86 % 

82% 

90% 

§ 

factor size (GB) 

28.4 

41.6 

28.4 

41.6 

36.3 

35.5 


solution time (s) 

1.4 

1.1 

1.5 

1.1 

2.4 

0.91 


solution flops (xlO 9 ) 

7.9 

10.5 

7.9 

10.5 

21.1 

18.2 


solution bandwidth (GB/s) 

20.3 

37.8 

18.9 

37.8 

15.1 

39.0 


fraction of peak 

42% 

78% 

39% 

78% 

31% 

80% 


total flops (xlO 12 ) 

4.9 

50.0 

4.9 

50.0 

10.1 

53.5 


total time (s) 

36.6 

264.1 

35.8 

261.8 

63.4 

266.4 


nested dissection time (s) 

2.5 

0.26 

2.1 

0.24 

2.9 

0.43 


separator reordering time (s) 

1.1 

0.40 

1.1 

0.35 

1.3 

0.68 

m 

m 

symbolic factorization time (s) 

2.0 

0.55 

2.1 

0.8 

2.9 

1.8 

K 

+ 

factorization time (s) 

14.5 

19.6 

13.6 

41.5 

30.5 

92.8 

factorization flops (xlO 12 ) 

1.5 

2.0 

1.3 

5.0 

3.9 

18.0 


flop rate (x 10 9 Gflop/s) 

103.4 

102.0 

95.6 

120.5 

127.9 

194.0 


fraction of peak 

45% 

44% 

41% 

52% 

56% 

84% 

3 

factor size (GB) 

22.5 

9.8 

21.6 

14.6 

29.6 

21.4 


fraction of multifrontal 

79% 

24% 

76% 

35% 

82% 

60% 


solution time (s) 

4.1 

15.3 

4.3 

75.9 

22.5 

71.2 


GMRES(30) iterations 

3 

67 

3 

234 

11 

152 


solution flops (xlO 9 ) 

25.6 

169.9 

23.7 

876.3 

210.5 

1,759.0 


solution bandwidth (GB/s) 

22.0 

43.6 

20.1 

45.2 

15.8 

46.0 


fraction of peak 

45% 

90% 

41% 

93% 

33% 

95% 


HSS levels t s (total) 

7 (22) 

8 (18) 

8 (22) 

7(18) 

7 (22) 

4 (18) 


HSS-rank 

48 

46 

50 

397 

139 

30 


HSS compression tolerance e 

lO" 6 

0.9 

10 -5 

0.1 

lO" 4 

0.9 


total flops (xlO 12 ) 

1.5 

2.2 

1.3 

5.9 

4.1 

19.8 


fraction of multifrontal 

30% 

4.4% 

27% 

12% 

41% 

37% 


total time (s) 

24.2 

36.1 

23.2 

118.8 

60.1 

166.9 


speedup 

1.52X 

7.14x 

1.54X 

2.22x 

1.05x 

1.59x 


Table 2: Comparison of the standard multifrontal solver and the multifrontal solver with HSS compression for a number of 
PDEs on regular grids. All experiments are run on a 12-core Intel® Ivy Bridge (peak 230.4Gflop/s and 48.5GB/s) in double 
precision. The code achieves good performance in terms of Gflop/s (for the factorization) or GByte/s (for the solve) and HSS 
compression leads to nice speedups over the standard multifrontal solver. A geometric nested dissection code is used for these 
regular grid problems. 


19 



0> 

s 




fa 




X 

+ 

fa 


P2D5000 2 


a 

C2D 5000 2 


a 

H2D4000 2 



PSD 125 3 C3D 125 3 HSD 100 3 


Figure 8 : Summary of the results from Table [2] Poisson (P), convection-diffusion (C) and Helmholtz (H) on 2D (left) and 
3D (right) regular grids on a 12-core Intel® Ivy Bridge. Poisson and convection-diffusion are in double precision, Helmholtz in 
complex double precision. 


6.2 Matrices from various applications 

Figure [9] shows a comparison of timings to solve linear systems with a number of matrices from applications. 
The matrices atmosmodd, Geo_1438, nlpkkt80, torso3, Transport and Serena are front the Uni¬ 
versity of Florida Sparse Matrix Collectior@. The other matrices tdr190k, A22 and SPE10-ANISOTROPIC, 
are from SciDAC projects at the DOE. The matrices are also listed in Table [Sj which additionally contains 
matrix Cube_Coup_dtO. This last matrix is not shown in Figure [9] because our multifrontal code ran 
out of memory during factorization unless HSS compression was used. All selected matrices are relatively 
large and originated from a 2D or 3D partial differential equation (on arbitrary domains). In Figure 0 our 
HSS enabled multifrontal solver (MF+HSS) is compared to the pure multifrontal method (MF) and to the 
state-of-the-art PARDISO solver £J9]. PARDISO, a multithreaded supernodal solver, is part of Intel® MKL. 
For MF and MF+HSS, reorder time includes nested dissection, MC64 and symmetrization of the sparsity 
structure and for MF+HSS also separator reordering. Factor time includes both symbolic and numerical 
factorization. The times are normalized to a total time 1 for MF. For the matrices selected for Figure O we 
see a consistent speedup from MF+HSS compared to pure MF and our MF solver always outperforms the 
PARDISO solver. PARDISO uses the same METIS nested dissection reordering as our implementation, with 
comparable reordering times for the different solvers. The supernodal pivoting scheme used in PARDISO for 
numerical stability does not affect the fill-in so the overall number of nonzeros in the factors with PARDISO 
and with our multifrontal code are very similar. Only for the A22 problem reordering the separator in order 
to reduce the HSS-ranks takes a lot of time. This is probably due to the addition of link-two edges to the 
separator graph (see Section 14.21) since the original matrix already has 246 nonzeros per row on average. 
However, if those extra edges are not taken into account, the HSS-ranks are much larger and there is no net 
performance benefit from using HSS. 

6.3 Many-core parallel performance 

Figure [10] shows performance and parallel scalability of the MF+HSS solver applied to the T0RS03.MTX 
matrix (£ s = 6, e = 0.5) on two leading multi-core architectures: a two sockets machine with a 12-core 

7 http://www.cise.uf1.edu/research/sparse/matrices/ 
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Figure 9: Comparison of timings for matrices from various applications on a 12-core Intel® Ivy Bridge. PARDISO is the 
sparse direct multithreaded solver from Intel® MKL. MF refers to our implementation of the multifrontal method and MF+HSS 
is our new multifrontal solver with HSS compression. The matrices are taken from the Florida Sparse Matrix Collection and 
from SciDAC projects at the DOE. For these matrices, which are all quite large and from 2D/3D PDE problems, our MF solver 
is faster than PARDISO and HSS compression gives an additional speedup. 



MF 

HSS 

matrix 

order 

#nnz 

factor 

solve 

4/fmax 

£ 

rank 

its 

factor 

solve 

atmosmodd 

1.2M 

8 .8M 

81s 

0.4s 

6/18 

0.9 

17 

88 

25s 

lls 

Geo 1438 

1.4M 

63M 

205s 

Is 

6/18 

0.9 

8 

318 

56s 

129s 

nlpkkt80 

1.1M 

29M 

197s 

0.7s 

6/18 

0.5 

59 

90 

49s 

23s 

tdrl90k 

1.1M 

43M 

19s 

0 .2s 

1/18 

Ter 7 

61 

2 

18s 

0.4s 

torso3 

.25M 

4.4M 

6 s 

0.05s 

6/15 

0.5 

36 

7 

5s 

0 .2s 

Transport 

1.6M 

23M 

80s 

0.5s 

3/18 

10 “* 

182 

24 

69s 

10 s 

A22 

.59M 

145M 

127s 

0.7s 

10/17 

0.1 

172 

18 

105s 

3s 

spelO-aniso 

1.2M 

31M 

88 s 

0.4s 

3/18 

Ter 7 

245 

21 

73s 

7.3s 

Serena* 

1.4M 

65M 

171s 

0.5s 

6/18 

0.9 

11 

111 

40s 

22 s 

Cube Coup dt0* 

2.2M 

65M 

- 

- 

8/19 

0.5 

100 

200 

60s 

63s 


*single precision experiment 


Table 3: Same as in Figure |9j comparison of timings for matrices from various applications on a 12-core Intel® Ivy Bridge. 
This table also shows the optimal number of HSS levels i s , the optimal compression tolerance e and the corresponding HSS-rank 
and number of GMRES(30) iterations. For the Serena and Cube_Coup_dt0 matrices the pure multifrontal method ran out of 
memory in double precision. 
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(a) Intel® Ivy Bridge. (b) Intel® Xeon Phi. 


Figure 10: Multi-core scalability of the different steps in the MF+HSS solver on two leading architectures. The MF+HSS 
solver is applied to the relatively small T0RS03.MTX matrix. The code achieves good speedup for the numerical factorization 
phase and reasonable speedup for the solve (MF+HSS preconditioned GMRES). Note that the sequential reordering codes 
METIS and MC64 become bottlenecks. 


Intel® Ivy Bridge Xeon per socket and a 60-core Intel® Xeon Phi Knight’s Corner. When running 12 or less 
threads on the dual socket 24-core Xeon system, the threads are all bound to a single socket (NUMA node). 
Note that since the Xeon Phi only has 8GB of memory, the larger problems from Table [3] do not fit in it’s 
memory. Our code shows good parallel scalability on both architectures for the numerical factorization phase 
and reasonable scalability for the solve phase. However, with increasing number of threads the reordering 
codes MC64 and METIS/SCOTCH quickly become scaling bottlenecks. The MC64 phase in Figure [TUI shows 
some parallel speedup since this time also includes applying the column permutation from MC64, which is 
done in parallel. 


7 Conclusions &; Outlook 

We presented an initial attempt to create a high performance implementation of a novel multifrontal solver 
with HSS low-rank structures. We show speedups of up to 7x over the pure multifrontal algorithm for a 
range of applications. Moreover, our implementation compares favorably to the commercial PARDISO solver. 
We observed that the new solver has lower computational complexity than the pure multifrontal method. 
However, the constants involved are much larger. We will focus our attention on trying to reduce these 
constants (for instance by trying to reduce the HSS-ranks) and on solving larger problems with a distributed 
memory implementation. As possible strategies to reduce the HSS-ranks we consider the following. A power 
iteration on the random vectors, for instance S r = (AA*) q AR with q a small integer, will improve the quality 
of the samples at the expense of additional computations; see [29] for further details. We believe the separator- 
reordering, see Section 14.21 can be improved, perhaps by taking into account the matrix entries and/or the 
underlying geometry, also leading to lower ranks. Finally, a better rank-revealing factorization, like a strong 
rank-revealing QR [23], might lead to lower ranks and possibly more stable ULV factorization. The solver 
with HSS compression achieves lower floating point operation throughput than the pure multifrontal code. 
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Hence, we believe there is some room for improvement. We will continue performance tuning of the code on 
various modern computer architectures. 

The presented code is part of a package called STRUMPACK. At the moment STRUMPACK has a 
sparse shared memory solver and a dense distributed memory solver. The longer term goal is to develop and 
maintain a single scalable code for both sparse and dense problems using hybrid parallelism. The current 
paper, together with the distributed HSS code developed for m are a good step towards reaching that goal. 

The research on fast sparse and dense direct solvers is a very active field at the moment. Some newer algo¬ 
rithmic ideas are for instance nested HSS approximation and matrix-free direct solver-based preconditioners. 
In nested HSS approximation, the HSS generators of the frontal matrices are themselves HSS matrices. This 
could further reduce the overall complexity of the solver. A matrix-free direct solver based preconditioner 
could be constructed using randomization techniques. It seems that knowledge of the sparsity pattern would 
be required for this. 
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